1. Introduction

Predictive policing is an analytical way to forecast potential criminal activities based on related factors, such as location and time. Since public safety resources like policing are limited, it is wise to allocate these resources to where crimes most likely to happen. Machine learning as a powerful tool also makes a lot of contribution to predict crimes.

However, predictive policing is controversial even if it is useful. Due to the complexity of physical environment and psychological factors, the data I obtain from real world can not be perfectly complete. Furthermore, since each piece of crime data is collected when a crime is reported, those less severe crimes might not be noticed or reported. This will probably causes bias to data so that I only predict existing crime instead of potential ones. Bearing these constraints in mind, I should be more rational when building a model.

Broken Windows Theory states that visible signs of civil disorder, such as vandalism, graffiti and public drinking, create an urban environment that promotes even more crime and disorder (Wilson & Kelling, 1982). In this report, I try to build a model to predict motor vehicle thefts in Chicago based on Broken Windows Theory. I choose moto vehicle theft as the research object.

2. Set up

2.1 Load package and source

knitr::opts_chunk$set(warning = FALSE, echo = TRUE)
# knitr::opts_chunk$set(eval = FALSE) # knit without running codes
library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
# functions
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

2.2 Import data

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
  
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))
autotheft <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>% 
    filter(Primary.Type == "MOTOR VEHICLE THEFT" & Description == "AUTOMOBILE") %>%
    mutate(x = gsub("[()]", "", Location)) %>%  # [ab] means a or b 
    separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct()

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 
## Reading layer `chicagoBoundary' from data source 
##   `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA///Chapter5/chicagoBoundary.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84

2.3 Maps of vehicle thefts by point and density

# uses grid.arrange to organize indpendent plots
grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = autotheft, colour="red", size=0.1, show.legend = "point") +
  labs(title= "Auto vehicle thefts, Chicago 2017") +
  mapTheme(title_size = 14),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(autotheft)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Moto vehicle thefts") +
  mapTheme(title_size = 14) + theme(legend.position = "none"))

The maps above show the location where vehicle thefts happened. In my opinion, for stolen vehicles data the selection bias is not very severe and the broken windows theory may be effective in this case. Since vehicle is more visible, valuable, and big in size than other kinds of stolen items, basically the owner of the lost car would report their lost. What’s more, stealing a car requires less people in surroundings, so it depends more on the environment than relatively random crimes like murder or attack.

2.4 Feature engineering - Count of vehicle theft by grid cell

Create fishnet.

fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = T) %>%
  .[chicagoBoundary] %>%            # <- MDH Added
  st_sf() %>%
  mutate(uniqueID = rownames(.))

Aggregate theft points to fishnet

## add a value of 1 to each crime, sum them with aggregate
crime_net =
  dplyr::select(autotheft) %>% 
  mutate(count = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(count = replace_na(count, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 57), # to make 100 groups
                       size=nrow(fishnet), replace = TRUE))


ggplot() +
  geom_sf(data = crime_net, aes(fill = count), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of Moto vehicle thefts for the fishnet") +
  mapTheme()

The map above shows the count of theft by grid cell. The clustered spatial pattern begins to take shape. In Chicago, two obvious dark belts shape the spatial structure of vehicle thefts. One is the Kennedy Expressway, a freeway in a southeast–northwest direction in north city, the north of which is relatively low count of vehicles stolen. The other dark belt in the middle of the city in a southwest–northeast is the Stevenson Expressway or the Chicago Sanitary and Shipping Canal nearby. Between these two belts is the highest vehicle theft cluster, and south to the Stevenson Expressway is the secondary high cluster. Notably, two of the endpoints of the belts are the grids that have most cases of vehicle thefts; one is Chicago Midway International Airport in the middle-south and the other is Fulton River District in the east.

2.5 Feature engineering - Count of risk factors by grid cell

abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")
  
abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%  
    filter(business_activity == "Retail Sales of Packaged Liquor") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Liquor_Retail")

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 
## Reading layer `chicago' from data source 
##   `https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 98 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84
garbagecarts <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Garbage-Carts-Historical/9ksk-na4q") %>%
    mutate(year = substr(creation_date,1,4)) %>%  filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Garbage_Carts")

policestation <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Police-Stations/z8bn-74gv") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "police_station")
trafficCrash <- 
  read.socrata("https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if") %>%
    mutate(year = substr(crash_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "trafficCrash")
vars_net <- 
  rbind(abandonCars, abandonBuildings, graffiti, streetLightsOut, sanitation, liquorRetail, trafficCrash, policestation, garbagecarts) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend, count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()
## `summarise()` has grouped output by 'uniqueID'. You can override using the
## `.groups` argument.
## Joining, by = "uniqueID"

The multiple maps below are different risk factors aggregate by grid cells. Different risk factors have different spatial pattern. For example, Abandoned_building mainly clusters in the south while Abandoned_cars clusters in the north.

vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol=3, top="Risk Factors by Fishnet"))

2.6 Feature engineering - Nearest neighbor distance

This second feature engineering approach is to calculate average nearest neighbor distance between the centroids of a grid and their nearest 3 risk factors.

st_c <- st_coordinates
st_coid <- st_centroid

trafficCrash.nn =
        nn_function(na.omit(st_c(st_coid(vars_net))), na.omit(st_c(trafficCrash)),3)

vars_net <-
  vars_net %>%
    mutate(
      Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3),
      Abandoned_Cars.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3),
      Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3),
      Liquor_Retail.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3),
      Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3),
      Sanitation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3),
      garbagecarts.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(garbagecarts),3),
      policestation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(policestation),3),
      trafficCrash.nn =
       trafficCrash.nn)
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID") 

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
    st_join(dplyr::select(policeDistricts, District)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()
## Joining, by = "uniqueID"
vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol = 3, top = "Nearest Neighbor risk Factors by Fishnet"))

3. Exploring the spatial process of vehicles theft

3.1 Local Moran’s I

The figures below show the local spatial autocorrelation of the vehicle theft. The higher local Moran’s I indicates the higher spatial clustering pattern. Here, the null hypothesis is that the vehicle theft count at a given location is randomly distributed relative to its immediate neighbors. In the output there are several useful test statistics including I, the p-value, and Significiant_Hotspots, defined as those grid cells with higher local counts than what might otherwise be expected under randomness (p-values <= 0.05).

## {spdep} to make polygon to neighborhoods... 
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)


# local Morans'I
local_morans <- localmoran(final_net$count, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(Vehicle_theft_Count = count, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)

vars.moran = unique(final_net.localMorans$Variable)
plots2 = list()
for(i in vars.moran){
  plots2[[i]] = ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="",option="C") +
      labs(title=i) +
      mapTheme(title_size = 14) + theme(legend.position="bottom")}
grid.arrange(grobs=plots2, ncol = 4, 
             top=textGrob("Local Morans I statistics", gp=gpar(fontsize=18)))

3.2 Factor correlations

These figure show the correlations of every factors.

## important to drop the geometry from joining features

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID,  -name, -District) %>%
    gather(Variable, Value, -count)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, count, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, count)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Burglary count as a function of risk factors") +
  plotTheme()
## `geom_smooth()` using formula 'y ~ x'

4. Poisson Regression and spatial cross validation

4.1 The histogram of vehicle theft count

Below is the histogram of vehicle theft count and it is similar to Poisson Regression. So I use Poisson Regression model to fit the data.

ggplot(final_net) +
  geom_histogram(aes(count),bins=40, color="black") +
  labs(title = "Histogram of vehicle theft count",
       subtitle = "San Francisco, 2018") +
  plotTheme()

4.2 Model validation

For model validation, I use the method of ‘Leave-one-group-out’ cross-validation (LOGO-CV), which is to hold out one local area, train the model on the other n - 1 areas, predict for the hold out, and record the goodness of fit. Each neighborhood takes a turn as a hold-out to do the entire process.

final_net <-
  final_net %>% 
  mutate(theft.isSig = 
           ifelse(localmoran(final_net$count, 
                             final_net.weights, zero.policy = TRUE)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(theft.isSig.dist = 
           nn_function(st_coordinates(st_centroid(final_net)),
                       st_coordinates(st_centroid(
                         filter(final_net, theft.isSig == 1))), 1))
reg.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", 
              "trafficCrash.nn","policestation.nn","garbagecarts.nn")

reg.ss.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", 
              "trafficCrash.nn","policestation.nn","garbagecarts.nn",
              "theft.isSig", "theft.isSig.dist")
crossValidate0 <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(count ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}
## RUN REGRESSIONS
reg.cv <- crossValidate0(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "count",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, count, Prediction, geometry)

reg.ss.cv <- crossValidate0(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "count",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, count, Prediction, geometry)
  
reg.spatialCV <- crossValidate0(
  dataset = final_net,
  id = "name",
  dependentVariable = "count",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, count, Prediction, geometry)

reg.ss.spatialCV <- crossValidate0(
  dataset = final_net,
  id = "name",
  dependentVariable = "count",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, count, Prediction, geometry)
reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - count,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - count,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - count,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - count,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf() 

The figure below is multiple map of model errors by random k-fold and spatial cross validation. These maps show how the errors distribut spatial when the local spatial process is not accounted for. It shows the largest errors are in the hotspot locations.

error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - count, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()
## `summarise()` has grouped output by 'Regression'. You can override using the
## `.groups` argument.
error_by_reg_and_fold %>%
  filter(str_detect(Regression, "LOGO")) %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Vehicle thefts errors by LOGO-CV Regression") +
    mapTheme() + theme(legend.position="bottom")

Below are histogram showing MAE with and without spatial process.

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    facet_wrap(~Regression) +  
    geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
    labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
         x="Mean Absolute Error", y="Count") +
    plotTheme()

Below is a table of MAE and standard deviation MAE by regression.

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable() %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#FDE725FF") %>%
    row_spec(4, color = "black", background = "#FDE725FF") 
Regression Mean_MAE SD_MAE
Random k-fold CV: Just Risk Factors 0.39 0.28
Random k-fold CV: Spatial Process 0.37 0.25
Spatial LOGO-CV: Just Risk Factors 1.44 1.37
Spatial LOGO-CV: Spatial Process 1.34 1.17
theftppp <- as.ppp(st_coordinates(autotheft), W = st_bbox(final_net))
theftKD.1000 <- spatstat.core::density.ppp(theftppp, 1000)
theftKD.1500 <- spatstat.core::density.ppp(theftppp, 1500)
theftKD.2000 <- spatstat.core::density.ppp(theftppp, 2000)
theftKD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(theftKD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(theftKD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(theftKD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft.")) 

theftKD.df$Legend <- factor(theftKD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

ggplot(data=theftKD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  facet_wrap(~Legend) +
  coord_sf(crs=st_crs(final_net)) + 
  scale_fill_viridis(name="Density") +
  labs(title = "Kernel density with 3 different search radii") +
  mapTheme(title_size = 14)

4.3 Race validation

The table below shows raw errors by race context for a random k-fold vs. spatial cross validation regression. Here the census tracts are split into 2 groups, Majority_White with percentage of white resident greater than 50% , otherwise Majority_Non_White. Since Error here is calculated by ‘observed vehicle theft count minus predicted count’, a positive error represents an over-prediction. This division helps to determine whether the model over-predicts in Minority areas and under-predicts in White areas.

census_api_key("03a335bce65590488fff9ce569a7c876ad4b0cae", overwrite = TRUE, install = TRUE)

tracts18 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2018, state=17, county=031, geometry=T) %>%
  st_transform('ESRI:102271')  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]
reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - count,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - count,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - count,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - count,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf() 
reg.summary %>% 
  filter(str_detect(Regression, "LOGO")) %>%
    st_centroid() %>%
    st_join(tracts18) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, raceContext) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(raceContext, mean.Error) %>%
      kable(caption = "Mean Error by neighborhood racial context") %>%
        kable_styling("striped", full_width = F)  

4.4 Comparing the risk prediction with traditional hotspot method for the next year’s crime.

Hotspot kernel density mapping is a method to show the places where crime is most concentrated and it is often used for police dispatching. Kernel density works by centering a smooth curve, atop each crime point such that the curve is at its highest directly over the point and decreasing to the edge of search radius. Below is a kernel density map generated by the points of crimes with a 1000 foot search radius.

theft_ppp <- as.ppp(st_coordinates(autotheft), W = st_bbox(final_net)) # 'spatstat' package can only process ppp class data

theft_KD <- spatstat.core::density.ppp(theft_ppp, 1000)

as.data.frame(theft_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
   ggplot() +
     geom_sf(aes(fill=value)) +
     geom_sf(data = sample_n(autotheft, 1500), size = .5,colour = "black" , alpha = .5) +
     scale_fill_viridis(name = "Density") +
     labs(title = "Kernel density of 2017 burglaries") +
     mapTheme()

Now I introduced observed data, vehicle theft in 2018, as a standard to compare which of the two method is better. In the map, black points represent the vehicle theft happened in 2018. The comparison shows that high risk areas in my prediction seemingly fit the spatial shape of the observed data, and outperforms the traditional kernel density method in terms of spatial pattern.

autotheft18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
  filter(Primary.Type == "MOTOR VEHICLE THEFT" & Description == "AUTOMOBILE") %>%
  mutate(x = gsub("[()]", "", Location)) %>%  # [ab] means a or b 
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]
theft_KDE_sf <- as.data.frame(theft_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(autotheft18) %>% mutate(theftCount = 1), ., sum) %>%
    mutate(theftCount = replace_na(theftCount, 0))) %>%
  dplyr::select(label, Risk_Category, theftCount)

theft_risk_sf <-
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(autotheft18) %>% mutate(theftCount = 1), ., sum) %>%
      mutate(theftCount = replace_na(theftCount, 0))) %>%
  dplyr::select(label,Risk_Category, theftCount)
rbind(theft_KDE_sf, theft_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(autotheft18, 3000), size = .3, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2017 burglar risk predictions; 2018 burglaries") +
    mapTheme()

The bar plot making this comparison.

rbind(theft_KDE_sf, theft_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(count = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = count / sum(count)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE) +
      labs(title = "Risk prediction vs. Kernel density, 2018 burglaries") +
      plotTheme() + theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
## `summarise()` has grouped output by 'label'. You can override using the
## `.groups` argument.

5. Conclusion

In the bar plot, the prediction only works well slightly in 50-69% and 90-100% categories, which might also caused by random errors. In 70-89% risk category, kernel density works even better. Therefore I don’t think this model is applicable in real life policing resource distribution.

For further improvement of the model, I have 2 ideas. The first one is to adjust the independent variables. Test correlation between each independent variables and the dependent variable which is the crime data. Second, training data set should also includes a longer period to reduce as much randomness as possible.